These notes are based on Chapter 7.4-7.6 of Kroese, D. P., & Chan, J. C. (2014). Statistical modeling and computation. New York: Springer.
In principle, the acceptance-rejection method can work in a high-dimensional problem if we can find a PDF \(g\) from which we can efficiently sample and \(C\) for a tight envelope \(Cg(x)\). It turns out this is a big IF. In 1-D cases, looking at the graph of target \(f(x)\), we are often able to design a reasonable \(Cg(x)\). Even if an envelope is not very tight, we can wait for some time and obtain enough samples from \(\text{U}(0,Cg(x))\) that are smaller than \(f(x)\). This is unlikely the case in highdimensional problems. Our visual images in a 4-D or higher space are poor (at least for me). In addition, the number of enough samples exponentially increases with dimensions, and we may not be allowed to wait for hours or days.
Markov chain Monte Carlo (MCMC) is a general class of methods to sequentially generate samples from an approximate distribution that increasingly resembles a target distribution. A sequence of samples are generated by running a Markov chain, which is designed to have the limiting distribution equal to the target distribution. By being content with only approximate distributions, MCMC enables us to handle functions on high-dimensional spaces, which are common in modern complex problems.
Since Monte Carlo simply refers to stochastic simulation, we might just call it Markov chain simulation. But, MCMC has sort of become the standard name everybody using statistical computing understands, perhaps because the acronym is just catchy.
A Markov chain is a sequence of random variables \(X_1,X_2\dots\) whose futures are conditionally independent of the past given the present. That is, for all \(i\geq0\), \[(X_{i+1}|X_i,X_{i-1},\dots) \sim (X_{i+1}|X_i) .\]
In an IID sequence (e.g., Bernoulli process), futures are independent even of the present.
If each \(X_i\) is a discrete random variable, \[P(X_{i+1}=x_{i+1}|X_i=x_i,X_{i-1}=x_{i-1},\dots) = P(X_{i+1}=x_{i+1}|X_i=x_i).\] In particular, if \(x_{i+1}\) is an element of a finite set that is common for all \(i\geq0\), then the Markove chain can be represented by the transition matrix, which we denote by \(Q\). That is, for all \(i\geq0\), \[Q(j,k) = P(X_{i+1}=k|X_i=j) = P(X_1=k|X_0=j)\] where \(j,k\) are elements of the finite set, which is usually called state space. For instance, if we start with \(X_0=j\), the probability of \(X_2=k\) is \[\begin{align} P(X_2=k|X_0=j) &= \sum_{s=1}^6 P(X_2=k,X_1=s|X_0=j)\\ &= \sum_{s=1}^6 P(X_2=k|X_1=s)\cdot P(X_1=s|X_0=j)\\ &= \sum_{s=1}^6 Q(j,s)\cdot Q(s,k)\\ &= Q^2(j,k)\\ \end{align}\]
As a concrete example, the following figure illustrates a Markov chain with state space of \(\{1,2,3,4,5,6\}\).
Q## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] 0.0 0.2 0.0 0.3 0.5 0.0
## [2,] 0.5 0.0 0.5 0.0 0.0 0.0
## [3,] 0.3 0.0 0.0 0.7 0.0 0.0
## [4,] 0.0 0.0 0.0 0.0 0.0 1.0
## [5,] 0.0 0.0 0.0 0.8 0.0 0.2
## [6,] 0.0 0.0 0.1 0.0 0.9 0.0
The probability of \(X_2=5\) given \(X_0=3\) is 0.15.
Q%*%Q## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] 0.10 0.00 0.10 0.40 0.00 0.40
## [2,] 0.15 0.10 0.00 0.50 0.25 0.00
## [3,] 0.00 0.06 0.00 0.09 0.15 0.70
## [4,] 0.00 0.00 0.10 0.00 0.90 0.00
## [5,] 0.00 0.00 0.02 0.00 0.18 0.80
## [6,] 0.03 0.00 0.00 0.79 0.00 0.18
In general, the probability of \(X_n=k\) given \(X_0=j\) is \(Q^n(j,k)\).
For most MCMC applications, we are interested in continuous random variables, i.e., the state space is uncountable. In this case, for \(A\subset\mathbb{R}^d\), \[P(X_{i+1}\in A|X_i=x_i,X_{i-1}=x_{i-1},\dots) = P(X_{i+1}\in A|X_i=x_i) .\]
In particular, we assume the conditional PDF \(f_{X_{i+1}|X_i}(y|x)\) does not depend on \(i\) and denote it by \(q\). That is, for all \(i\geq0\), \[q(y|x) = f_{X_{i+1}|X_i}(y|x)=f_{X_1|X_0}(y|x),\] where \(x,y\in\mathbb{R}^d\) are states. We call \(q\) the transition density of the (continuous-state) Markov chain.
A state distribution is said to be stationary if its PDF \(f\) satisfies the following global balance equations: \[ f(x_{i+1}) = \int f(x_{i})q(x_{i+1}|x_{i})dx_{i} . \]
You may see \(f(x_{i})q(x_{i+1}|x_{i})\) as the joint density for \(X_i\) and \(X_{i+1}\). As usual, if we integrate out \(x_i\), we have the marginal density for \(X_{i+1}\). In this special dependency, however, the marginal for \(X_{i+1}\) turns out to be the same as the one for \(X_i\), namely \(f\).
Given the initial value/state \(x_0\), if we can draw samples from \(q\), then we can simulate a Markov chain and sequentially obtain samples \(x_1,x_2,\dots,x_n\).
The rationale for Markov chain as a method for sampling from a desired target distribution \(f\) is:
Samples from the first \(I\) steps (so-called “burn-in” period) are often discarded.
It is easy to approximate the limiting distribution for a finite-state Markov chain. For instance, for \(Q\) defined above, the stationary distribution is
p <- Re(eigen(t(Q))$vectors[,1]) # eigenvector for eigenvalue = 1
p <- p/sum(p) # normalisation
p## [1] 0.011978917 0.002395783 0.035936751 0.283660757 0.318639195 0.347388596
We can see \(Q^{500}\) having the identical rows, each of which is almost equal to the stationary distribution.
n <- 500
Qn <- Q%^%n
Qn## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] 0.01197892 0.002395783 0.03593675 0.2836608 0.3186392 0.3473886
## [2,] 0.01197892 0.002395783 0.03593675 0.2836608 0.3186392 0.3473886
## [3,] 0.01197892 0.002395783 0.03593675 0.2836608 0.3186392 0.3473886
## [4,] 0.01197892 0.002395783 0.03593675 0.2836608 0.3186392 0.3473886
## [5,] 0.01197892 0.002395783 0.03593675 0.2836608 0.3186392 0.3473886
## [6,] 0.01197892 0.002395783 0.03593675 0.2836608 0.3186392 0.3473886
What does this imply?
p0 <- runif(6)
p0 <- p0/sum(p0) # random initial state distribution
p0%*%Qn## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] 0.01197892 0.002395783 0.03593675 0.2836608 0.3186392 0.3473886
No matter which state we are in at the beginning, we will end up with the same state distribution at time 500 onwards.
A deterministic \(X_0=x_0\) is a special case, e.g., \(p_0 = (0,0,0,1,0,0)\) for \(X_0=4\).
In other words, an estimate of this limiting distribution using MCMC is
n <- 1e6
x <- rep(0, n)
x[1] <- sample(1:6, 1) # random initial state
for (i in 2:n) {
x[i] <- sample(1:6, 1, prob=Q[x[i-1],]) # MCMC sampling
}
table(x[501:n])/(n-500) # 500 burn-in periods##
## 1 2 3 4 5 6
## 0.011872936 0.002350175 0.035807904 0.283768884 0.318779390 0.347420710
In contrast, if we sample from the true stationary distribution, an MC estimate is
x <- sample(1:6, n-500, replace=TRUE, prob=p)
table(x)/(n-500)## x
## 1 2 3 4 5 6
## 0.012007004 0.002414207 0.035835918 0.283634817 0.318143072 0.347964982
As learned in importance sampling, sampling from the true distribution is not necessarily the best thing to do in terms of estimator variance.
The task in MCMC applications is to design a good Markov chain so that the chain converges to the desired target distribution. The following are key design principles.
Ergodicity is the condition for a Markov chain to converge to a unique stationary distribution irrespective of \(x_0\). It consists of two intuitive properties: irreducibility and aperiodicity.
Irreducibility ensures that every state \(x\) can be visited from every other state within finite steps \((i<\infty)\). This is important because, in MCMC, a sequence of samples are generated through dependent sampling according to the transition density \(q\), and some \(x_{i+1}\) may not be reached directly from the current \(x_i\), i.e., \(q(x_{i+1}|x_i)=0\).
Aperiodicity prevents indefinite oscillation and allows the chain to converge to a stationary distribution.
See Roberts & Rosenthal (2004) for more details.
Among the ergodic Markov chains, we could in principle choose a transition density \(q\) that satisfies the global balance equations in order to ensure that the chain converges to our target \(f\). In practice, however, this is tricky and challenging. So, we typically impose stronger structure called the detailed balance equations. \[f(x)q(y|x) = f(y)q(x|y), \;\forall x,y.\]
By integrating both sides over \(x\), we see the detailed balance equations imply the global balance equations.
Here is an example of MCMC for sampling from a bivariate normal distribution \(N(\mu,\Sigma)\) with \[ \mu=0 \quad\text{and}\quad \Sigma=\begin{bmatrix}1 & 0.8\\0.8 & 1\\\end{bmatrix} .\]
Remember that the joint PDF looks like this:
r <- 0.8
f <- function(x,y) 1/(2*pi*sqrt(1-r^2))*exp(-1/(2*(1-r^2))*(x^2+y^2-2*r*x*y))
XY = meshgrid(seq(-3, 3, length.out=100),
seq(-3, 3, length.out=100))
Z <- f(XY$X, XY$Y)
plot_ly(x=XY$X, y=XY$Y, z=Z, showscale=F) %>% add_surface()Using Gibbs sampler, a result may look like the following.
set.seed(123)
n <- 300
m <- 5
x <- rep(0,n)
y <- rep(0,n)
x[1] <- 5*runif(1)-2.5
y[1] <- 5*runif(1)-2.5
for (i in 2:n) {
x[i] <- rnorm(1, mean=r*y[i-1], sd=sqrt(1-r^2))
y[i] <- rnorm(1, mean=r*x[i], sd=sqrt(1-r^2))
par(mar=c(5,5,.1,.1))
plot(x[1], y[1], xlab="x", ylab="y", xlim=c(-3,3), ylim=c(-3,3), pch=4)
if (i <= m) {
segments(x[-i],y[-i],x[-1],y[-1],lwd=.5)
} else {
points(x[2:(i-(m-1))], y[2:(i-(m-1))], pch=20, cex=.3)
segments(x[(i-m):(i-2)], y[(i-m):(i-2)],
x[(i-(m-1)):(i-1)], y[(i-(m-1)):(i-1)], lwd=.5)
}
arrows(x[i-1],y[i-1],x[i],y[i], length=0.08, angle=20, lwd=1)
}Note the dependency of each new point on the previous point; each arrow goes only so far. It is not an IID sample from the target distribution. But, collectively, these points are (approximately) distributed as if sampled from the target distribution.
Burn-in periods are omitted for this specific visualisation. In practice, we usually throw away some earliest samples.
In practice, most of us do not design a transition density \(q\) from scratch; rather, we use proven design templates. The Metropolis-Hastings algorithm provides a good template. Suppose we have an ergodic Markov chain characterised by a transition density \(q\). The Metropolis-Hastings algorithm tells us how to build upon this Markov chain and design another one that satisfies the detailed balance equations with respect to the target \(f\).
The ergodiciy of the new chain is implied by the ergodiciy of the origianl Markov chain with \(q\).
The idea is similar to the acceptance-rejection method.
The acceptance probability is defined as follows: \[\alpha(x,y) = \min\left(1, \frac{f(y)q(x|y)}{f(x)q(y|x)}\right) .\]
An intuition is the following. If we have a very good proposal density \(q\), which is very similar to the target function \(f\), we will have both ratios \(f(y)/q(y|x)\) and \(f(x)/q(x|y)\) close to 1, leading to \(\alpha\) close to 1. Hence, most of the time, we accept proposals.
The new transition density for \(x_{i+1}\neq x_i\) is \[\tilde{q}(x_{i+1}|x_i) = q(x_{i+1}|x_i)\alpha(x_i,x_{i+1}) .\] Less importantly, the probability of not moving is \[P(X_{i+1}=x_i|X_i=x_i) = 1 - \int_{y\neq x_i} q(y|x_i)\alpha(x_i,y)dy .\]
Let’s make sure that \(\tilde{q}\) satisfies the detailed balance equations: \[f(x)\tilde{q}(y|x) = f(y)\tilde{q}(x|y), \;\forall x,y.\] First, notice that, regardless of \(\tilde{q}\), it trivially holds for \(y=x\). This is why \(X_{i+1}=x_i\) is unimportant. Next, for \(y\neq x\), \[\begin{align} f(x)\tilde{q}(y|x) &= f(x)q(y|x)\alpha(x,y)\\ &= f(x)q(y|x)\min\left(1, \frac{f(y)q(x|y)}{f(x)q(y|x)}\right)\\ &= \min\left(f(x)q(y|x),\; f(y)q(x|y)\right), \end{align}\] which is symmetric between \(x\) and \(y\) (i.e., \(\min(\cdot,\cdot)\) returns the same value even if we switch \(x\) and \(y\)). Hence, \(f(y)\tilde{q}(x|y)\) have the same value equal to \(f(x)\tilde{q}(y|x)\). Now, the detailed balance equations follow.
The Gibbs sampling (a.k.a. alternating conditional sampling) is a special case of the Metropolis-Hastings algorithm and widely used to address multidimensional problems. It is particularly suitable for Bayesian hierarchical models due to the modularised model construction. Suppose \(x\) is a vector in a multidimensional space and divided into \(d\) sub-vectors \(x=(x^{(1)},\dots,x^{(d)})\). (N.B. Each sub-vector need not be in one dimension.) Let \(f^{(k)}\) be the conditional PDF of \(x^{(k)}\) \((k\in\{1,\dots,d\})\) given the values of the other sub-vectors: \[f^{(k)}(x^{(k)}|x^{(1)},\dots,x^{(k-1)},x^{(k+1)},\dots,x^{(d)}) .\] The Gibbs sampler assumes we can sample from each of these conditional PDFs, and proceeds as follows.
You get the idea. \(x_{i+1} = (x_{i+1}^{(1)},\dots,x_{i+1}^{(d)})\) is technically a proposal in the context of the Metropolis-Hastings algorithm, but \(x_{i+1}\) is always accepted in the Gibbs sampler.
Go back to the demo above and see whether the code makes sense.
It is straightforward to recognise \(f\) as stationary under the Gibbs sampling. Consider sampling \(x^{(k)}\) using \[f^{(k)}(x^{(k)}|x^{(1)},\dots,x^{(k-1)},x^{(k+1)},\dots,x^{(d)}) .\] First, the marginal \(f^{(-k)}(x^{(-k)})\) where \(x^{(-k)} = (x^{(1)},\dots,x^{(k-1)},x^{(k+1)},\dots,x^{(d)})\) remains invariant because no sampling occurs for \(x^{(-k)}\). Next, by definition, the Gibbs sampler draws a sample of \(x^{(k)}\) from the correct conditional distribution \(f^{(k)}(x^{(k)}|x^{(-k)})\). Thus, the joint PDF \[f(x) = f^{(k)}(x^{(k)}|x^{(-k)})\cdot f^{(-k)}(x^{(-k)})\] remains invariant under the Gibbs sampling.
How about the ergodicity? Here is a sufficient condition for irreducibility under which we often use the Gibbs sampler. Let \(\Omega\) be the support of the joint distribution \(f(x)\). Then, if the support of \(f^{(k)}(x^{(k)}|x^{(-k)})\) coincides with the slice of \(\Omega\) at an arbitrary \(x^{(-k)}\), then it means that every \(x^{(k)}\) can be sampled from any \(x\). This is clearly the case in the above demo where the support of each conditional normal distribution is the entire real line. For aperiodicity, at this level, you may take it for granted in almost any non-trivial applications.
Looking at a sample \(x_{i+1}^{(k)}\) drawn from \[f^{(k)}(x_{i+1}^{(k)}|x_{i+1}^{(1)},\dots,x_{i+1}^{(k-1)},x_{i}^{(k+1)},\dots,x_{i}^{(d)}),\] we may put \(x_{i+1}^{(k)}\) in a full-length vector \(x\) and index it as \(j+1\). That is, \[\begin{align} \vdots&\\ x_{j} &= (x_{i+1}^{(1)},\dots,x_{i+1}^{(k-1)},x_{i}^{(k)},x_{i}^{(k+1)},\dots,x_{i}^{(d)})\\ x_{j+1} &= (x_{i+1}^{(1)},\dots,x_{i+1}^{(k-1)},x_{i+1}^{(k)},x_{i}^{(k+1)},\dots,x_{i}^{(d)})\\ \vdots& \end{align}\]
Note that the transition densities of \(x_j\to x_{j+1}\) and \(x_{j+1}\to x_j\) are both given by the conditional PDF for \(x^{(k)}\), \[\begin{align} q(x_{j+1}|x_j) &= f^{(k)}(x_{i+1}^{(k)}|x_{i+1}^{(1)},\dots,x_{i+1}^{(k-1)},x_{i}^{(k+1)},\dots,x_{i}^{(d)})\\ &= f^{(k)}(x_{i+1}^{(k)}|x^{(-k)})\\ q(x_j|x_{j+1}) &= f^{(k)}(x_{i}^{(k)}|x_{i+1}^{(1)},\dots,x_{i+1}^{(k-1)},x_{i}^{(k+1)},\dots,x_{i}^{(d)})\\ &= f^{(k)}(x_{i}^{(k)}|x^{(-k)}), \end{align}\] where we use \(x^{(-k)} = x_{i+1}^{(1)},\dots,x_{i+1}^{(k-1)},x_{i}^{(k+1)},\dots,x_{i}^{(d)}\). Now, it follows that \[\begin{align} \alpha(x_{j},x_{j+1}) &= \min\left(1,\; \frac{f(x_{j+1})q(x_{j}|x_{j+1})}{f(x_{j})q(x_{j+1}|x_{j})}\right)\\ &= \min\left(1,\; \frac{f^{(k)}(x_{i+1}^{(k)}|x^{(-k)}) f^{(-k)}(x^{(-k)})f^{(k)}(x_{i}^{(k)}|x^{(-k)})}{f^{(k)}(x_{i}^{(k)}|x^{(-k)}) f^{(-k)}(x^{(-k)})f^{(k)}(x_{i+1}^{(k)}|x^{(-k)})}\right)\\ &= \min(1, 1)\\ &= 1 \end{align}\]
Imagine you try to model the number of customers at each of two bars in Melbourne using Poisson distribution, namely \(\text{pois}(\lambda_1)\) for Bar 1 and \(\text{pois}(\lambda_2)\) for Bar 2. Based on other information, you think \(\lambda_1\) and \(\lambda_2\) are distinct but somehow related. So, you assume both \(\lambda_1\) and \(\lambda_2\) to be distributed as \(\text{exp}(\theta)\). Finally, you complete the model by simply assuming \(\theta \sim \text{exp}(1)\).
Suppose you are going to collect \(m_1\) samples from Bar 1 and \(m_2\) samples from Bar 2. Then, your model is:
\[\begin{gather} X_1, X_2, \dots, X_{m_1} \sim \text{pois}(\lambda_1)\\ X_{m_1+1}, X_{m_1+2}, \dots, X_{m_1+m_2} \sim \text{pois}(\lambda_2)\\ \lambda_1, \lambda_2 \sim \text{exp}(\theta)\\ \theta \sim \text{exp}(1) \end{gather}\] (All random variables are conditionally independent of each other.)
You naturally expect learning something about the average numbers of customers at Bar 1 and Bar 2 (\(\lambda_1\) and \(\lambda_2\)) and how they are related.
Given the collected data \[\mathbf{x}=(\mathbf{x}_1,\mathbf{x}_2)=(x_1,\dots,x_{m_1},x_{m_1+1},\dots,x_{m_1+m_2}),\] you derive \(f(\lambda_1, \lambda_2, \theta|\mathbf{x})\), a joint PDF of \((\lambda_1, \lambda_2, \theta)\) conditional on \(\mathbf{x}\).
\[\begin{align} f(\lambda_1,\lambda_2,\theta|\mathbf{x}) &= \frac{f(\mathbf{x},\lambda_1,\lambda_2,\theta)}{f(\mathbf{x})}\\ &\propto f(\mathbf{x},\lambda_1,\lambda_2,\theta)\\ &= f(\mathbf{x}_1|\lambda_1)\cdot f(\mathbf{x}_2|\lambda_2)\cdot f(\lambda_1|\theta)\cdot f(\lambda_2|\theta)\cdot f(\theta)\\ &= \prod_{i=1}^{m_1}f_1(x_i|\lambda_1) \cdot \prod_{j=1}^{m_2}f_2(x_j|\lambda_2) \cdot f(\lambda_1|\theta) \cdot f(\lambda_2|\theta) \cdot f(\theta)\\ &= \prod_{i=1}^{m_1}\frac{e^{-\lambda_1}\lambda_1^{x_i}}{x_i!}\cdot\prod_{j=1}^{m_2}\frac{e^{-\lambda_2}\lambda_2^{x_j}}{x_j!}\cdot\theta e^{-\theta\lambda_1}\cdot\theta e^{-\theta\lambda_2}\cdot e^{-\theta}\\ &\propto e^{-m_1\lambda_1}\lambda_1^{\sum_{i=1}^{m_1}x_i}\cdot e^{-m_2\lambda_2}\lambda_2^{\sum_{j=1}^{m_2}x_j}\cdot\theta e^{-\theta\lambda_1}\cdot\theta e^{-\theta\lambda_2}\cdot e^{-\theta}\\ &= \theta^2 \cdot e^{-\theta} \cdot e^{-(\theta+m_1)\lambda_1}\cdot e^{-(\theta+m_2)\lambda_2} \cdot \lambda_1^{\sum_{i=1}^{m_1}x_i} \cdot \lambda_2^{\sum_{j=1}^{m_2}x_j} \end{align}\]
\(f(\lambda_1, \lambda_2, \theta|\mathbf{x})\) does not look like any familiar PDF.
Let’s implement a Gibbs sampler to sample from this joint PDF. First, you need to derive the following full conditionals.
\[\begin{align} f(\lambda_1|\lambda_2,\theta,\mathbf{x}) &\propto \lambda_1^{\sum_{i=1}^{m_1}x_i} e^{-(\theta+m_1)\lambda_1} \sim \text{Gamma}\left(1+\sum_{i=1}^{m_1}x_i, \theta+m_1\right)\\ f(\lambda_2|\lambda_1,\theta,\mathbf{x}) &\propto \lambda_2^{\sum_{j=1}^{m_2}x_j} e^{-(\theta+m_2)\lambda_2} \sim \text{Gamma}\left(1+\sum_{j=1}^{m_2}x_j, \theta+m_2\right)\\ f(\theta|\lambda_1,\lambda_2,\mathbf{x}) &\propto \theta^2 e^{-(1+\lambda_1+\lambda_2)\theta} \sim \text{Gamma}(3, 1+\lambda_1+\lambda_2). \end{align}\]
Then, you cycle through them for \(n\) rounds of sampling and return a \(n\times 3\) matrix.
sampler <- function(n, x1, x2) {
set.seed(90045)
m1 <- length(x1)
m2 <- length(x2)
s1 <- sum(x1)
s2 <- sum(x2)
D <- matrix(0, n, 3)
for (i in 2:n) {
D[i,1] <- rgamma(1, 1+s1, m1+D[i-1,3])
D[i,2] <- rgamma(1, 1+s2, m2+D[i-1,3])
D[i,3] <- rgamma(1, 3, 1+D[i,1]+D[i,2])
}
return(D)
}For example,
x1 <- c(10,9,6,19,8,13,15,7,11)
x2 <- c(14,10,11,9,9,6,13,19,10,8,6,13,8,15,21,7,12,11)
n <- 1e6
D <- sampler(n, x1, x2)
burn_in <- floor(0.05*n) # throwing away 1st 5% of samples
D <- D[(burn_in+1):n,]Using the samples D, you can approximate many quantities. For example, - the mean of \(\lambda_1\), - the mean of \(\lambda_2\), and - the correlation coefficient between \(\lambda_1\) and \(\lambda_2\).
\[\begin{align} \mu_1 &= \mathbb{E}[\lambda_1|\mathbf{x}]\\ \mu_2 &= \mathbb{E}[\lambda_2|\mathbf{x}]\\ \sigma^2_1 &= \mathbb{E}[(\lambda_1-\mu_1)^2|\mathbf{x}]\\ \sigma^2_2 &= \mathbb{E}[(\lambda_1-\mu_2)^2|\mathbf{x}]\\ \text{Cov}(\lambda_1,\lambda_2) &= \mathbb{E}[(\lambda_1-\mu_1)(\lambda_2-\mu_2)|\mathbf{x}]\\ \rho &= \frac{\text{Cov}(\lambda_1,\lambda_2)}{\sigma_1\sigma_2} \end{align}\]
mu1 <- mean(D[,1])
mu2 <- mean(D[,2])
sig1 <- mean((D[,1]-mu1)^2)
sig2 <- mean((D[,2]-mu2)^2)
cov <- mean((D[,1]-mu1)*(D[,2]-mu2))
rho <- cov/sqrt(sig1*sig2)Results are \[\hat{\mu}_1 = 10.84,\; \hat{\mu}_2 = 11.2,\; \hat{\rho} = 0.0047.\] It seems two bars have very similar average customers. However, as far as the correlation is concerned, they are unrelated.